Use SQLAlchemy with the UCSC Database

We are going to use SQLAlchemy first to pull a list of tables, and then to create an ORM of the table we care about, in this case SNP147.

To do this we will be makign use of SQLAlchemy's autmapping functionality, which creates ORM object directly from tables. Unfortunately, it fails with a number of USCS tables because they don't have detectable primary keys, which are required for the ORM. Because of this we will first inspect the database and dump a list of tables, and then explicity declare a class for the table we care about. SQLAlchemy will fill in all extra columns for us, so we only need to explicitly declare the class and primary key, everything else is done for us.


In [1]:
import pandas as pd
from sqlalchemy import inspect
from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import Session
from sqlalchemy.sql import select
from sqlalchemy.ext.automap import automap_base

In [2]:
# Connect to the hg19 database
engine = create_engine("mysql+pymysql://genome@genome-mysql.cse.ucsc.edu/{organism}?charset=utf8mb4".format(organism='hg19'))

Get table


In [3]:
# Get the list of tables
inspector = inspect(engine)
tables = inspector.get_table_names()

This list is very long. You can just look through it, but we are going to filter it for tables that begin with 'snp'


In [4]:
len(tables)


Out[4]:
11048

In [5]:
snp_tables = [i for i in tables if i.startswith('snp')]

In [6]:
sorted(snp_tables)


Out[6]:
['snp138',
 'snp138CodingDbSnp',
 'snp138Common',
 'snp138ExceptionDesc',
 'snp138Flagged',
 'snp138Mult',
 'snp138OrthoPt4Pa2Rm3',
 'snp138Seq',
 'snp141',
 'snp141CodingDbSnp',
 'snp141Common',
 'snp141ExceptionDesc',
 'snp141Flagged',
 'snp141OrthoPt4Pa2Rm3',
 'snp141Seq',
 'snp142',
 'snp142CodingDbSnp',
 'snp142Common',
 'snp142ExceptionDesc',
 'snp142Flagged',
 'snp142Mult',
 'snp142OrthoPt4Pa2Rm3',
 'snp142Seq',
 'snp144',
 'snp144CodingDbSnp',
 'snp144Common',
 'snp144ExceptionDesc',
 'snp144Flagged',
 'snp144Mult',
 'snp144OrthoPt4Pa2Rm3',
 'snp144Seq',
 'snp146',
 'snp146CodingDbSnp',
 'snp146Common',
 'snp146ExceptionDesc',
 'snp146Flagged',
 'snp146Mult',
 'snp146OrthoPt4Pa2Rm3',
 'snp146Seq',
 'snp147',
 'snp147CodingDbSnp',
 'snp147Common',
 'snp147ExceptionDesc',
 'snp147Flagged',
 'snp147Mult',
 'snp147OrthoPt4Pa2Rm3',
 'snp147Seq',
 'snpArrayAffy250Nsp',
 'snpArrayAffy250Sty',
 'snpArrayAffy5',
 'snpArrayAffy6',
 'snpArrayAffy6SV',
 'snpArrayIllumina1M',
 'snpArrayIllumina1MRaw',
 'snpArrayIllumina300',
 'snpArrayIllumina550',
 'snpArrayIllumina650',
 'snpArrayIlluminaHuman660W_Quad',
 'snpArrayIlluminaHuman660W_QuadRaw',
 'snpArrayIlluminaHumanCytoSNP_12',
 'snpArrayIlluminaHumanCytoSNP_12Raw',
 'snpArrayIlluminaHumanOmni1_Quad',
 'snpArrayIlluminaHumanOmni1_QuadRaw']

OK, so let's use snp147.

Get Columns and Build ORM

We are now going to make the ORM, so let's check the actual columns of the table.


In [7]:
inspector.get_columns('snp147')


Out[7]:
[{'autoincrement': False,
  'default': None,
  'name': 'bin',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'chrom',
  'nullable': False,
  'type': VARCHAR(length=31)},
 {'autoincrement': False,
  'default': None,
  'name': 'chromStart',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'autoincrement': False,
  'default': None,
  'name': 'chromEnd',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'default': None,
  'name': 'name',
  'nullable': False,
  'type': VARCHAR(length=15)},
 {'autoincrement': False,
  'default': None,
  'name': 'score',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'strand',
  'nullable': False,
  'type': ENUM('+', '-')},
 {'default': None, 'name': 'refNCBI', 'nullable': False, 'type': BLOB()},
 {'default': None, 'name': 'refUCSC', 'nullable': False, 'type': BLOB()},
 {'default': None,
  'name': 'observed',
  'nullable': False,
  'type': VARCHAR(length=255)},
 {'default': None,
  'name': 'molType',
  'nullable': False,
  'type': ENUM('unknown', 'genomic', 'cDNA')},
 {'default': None,
  'name': 'class',
  'nullable': False,
  'type': ENUM('single', 'in-del', 'microsatellite', 'named', 'mnp', 'insertion', 'deletion')},
 {'default': None, 'name': 'valid', 'nullable': False, 'type': SET(length=15)},
 {'default': None, 'name': 'avHet', 'nullable': False, 'type': FLOAT()},
 {'default': None, 'name': 'avHetSE', 'nullable': False, 'type': FLOAT()},
 {'default': None, 'name': 'func', 'nullable': False, 'type': SET(length=14)},
 {'default': None,
  'name': 'locType',
  'nullable': False,
  'type': ENUM('range', 'exact', 'between', 'rangeInsertion', 'rangeSubstitution', 'rangeDeletion', 'fuzzy')},
 {'autoincrement': False,
  'default': None,
  'name': 'weight',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'default': None,
  'name': 'exceptions',
  'nullable': False,
  'type': SET(length=26)},
 {'autoincrement': False,
  'default': None,
  'name': 'submitterCount',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'submitters',
  'nullable': False,
  'type': LONGBLOB()},
 {'autoincrement': False,
  'default': None,
  'name': 'alleleFreqCount',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None, 'name': 'alleles', 'nullable': False, 'type': LONGBLOB()},
 {'default': None, 'name': 'alleleNs', 'nullable': False, 'type': LONGBLOB()},
 {'default': None,
  'name': 'alleleFreqs',
  'nullable': False,
  'type': LONGBLOB()},
 {'default': None,
  'name': 'bitfields',
  'nullable': False,
  'type': SET(length=33)}]

There are a lot of columns, but the vast majority will automap just fine, so we don't need to do anything. However, we do need to add a primary key as none exists here.

I actually just want to look up the SNP name by the position, so I am going to explicitly declare a primary key (since I don't care about their primary key and some UCSC tables have primary keys that are not dectected by the automapper) and also declare the columns I care about. The table will end up with all columns, but I want to guarantee access to the columns that I care about.


In [8]:
# The automap base will detect all tables in the database and create classes for
# as many as it can. Many UCSC tables don't become classes because the primary key
# cannot be detected.
Base = automap_base()

class SNP147(Base):
    __tablename__ = 'snp147'
    
    name = Column(String(length=15), primary_key=True, nullable=False)
    
    # The following columns do not need to be declared, the automapper will do it for
    # us. I map them anyway for my own personal reference.
    chrom      = Column(String(length=31), nullable=False)
    chromStart = Column(Integer, nullable=False)
    chromEnd   = Column(Integer, nullable=False)

# reflect the tables
Base.prepare(engine, reflect=True)

In [9]:
session = Session(engine)

Query the table


In [10]:
session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd == 16952481).first()


Out[10]:
('rs915311')

In [11]:
positions = [
    154326279,
    11029552,
    241803636,
    59165838,
    39991588,
    204733046,
    16341354,
    16971948,
    154056834,
    9712006
]

In [12]:
q = session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions))

In [13]:
q.all()


Out[13]:
[('rs78800005'),
 ('rs74052099'),
 ('rs1048245'),
 ('rs2209174'),
 ('rs3738676'),
 ('rs11207297'),
 ('rs72694300'),
 ('rs4040747'),
 ('rs12034642'),
 ('rs7513575')]

In [14]:
df = pd.read_sql_query(q.statement, engine)

In [15]:
df


Out[15]:
name
0 rs78800005
1 rs74052099
2 rs1048245
3 rs2209174
4 rs3738676
5 rs11207297
6 rs72694300
7 rs4040747
8 rs12034642
9 rs7513575

In [16]:
df2 = pd.read_sql_query(
    session.query(SNP147.name, SNP147.chrom, SNP147.chromEnd).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions)).statement,
    engine
)

In [17]:
df2


Out[17]:
name chrom chromEnd
0 rs78800005 chr1 9712006
1 rs74052099 chr1 11029552
2 rs1048245 chr1 16341354
3 rs2209174 chr1 16971948
4 rs3738676 chr1 39991588
5 rs11207297 chr1 59165838
6 rs72694300 chr1 154056834
7 rs4040747 chr1 154326279
8 rs12034642 chr1 204733046
9 rs7513575 chr1 241803636

That's it, simple! Yes, you can just query the MySQL directly, but this approach is much more robust for on-the-fly data analysis. It is easy to get SQL syntax wrong, but using SQLAlchemy in Jupyter is trivial.